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Rust fungi are a group of fungal pathogens that cause some of the world's most 
destructive diseases of trees and crops. A shared characteristic among rust fungi is 
obligate biotrophy, the inability to complete a lifecycle without a host. This dependence 
on a host species likely affects patterns of gene expansion, contraction, and innovation 
within rust pathogen genomes. The establishment of disease by biotrophic pathogens 
is reliant upon effector proteins that are encoded in the fungal genome and secreted 
from the pathogen into the host's cell apoplast or within the cells. This study uses 
a comparative genomic approach to elucidate putative effectors and determine their 
evolutionary histories. We used OrthoMCL to identify nearly 20,000 gene families in 
proteomes of 16 diverse fungal species, which include 15 basidiomycetes and one 
ascomycete. We inferred patterns of duplication and loss for each gene family and 
identified families with distinctive patterns of expansion/contraction associated with the 
evolution of rust fungal genomes. To recognize potential contributors for the unique 
features of rust pathogens, we identified families harboring secreted proteins that: (i) arose 
or expanded in rust pathogens relative to other fungi, or (ii) contracted or were lost in rust 
fungal genomes. While the origin of rust fungi appears to be associated with considerable 
gene loss, there are many gene duplications associated with each sampled rust fungal 
genome. We also highlight two putative effector gene families that have expanded in Cqf 
that we hypothesize have roles in pathogenicity. 
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INTRODUCTION 

Rust fungi are plant infecting filamentous fungi in the order 
Pucciniales (Basidiomycota) that are unified by obligate biotro- 
phy (Voegele and Mendgen, 2011). This form of pathogenicity 
requires a live host to establish a parasitic relationship. This is 
accomplished through the establishment of a molecularly inti- 
mate interaction at the host-pathogen interface characterized by 
the secretion of an arsenal of proteins from the pathogen that 
suppress host defense mechanisms and promote the acquisi- 
tion of essential nutrients by the pathogen (Dodds et al., 2009; 
Stergiopoulos and de Wit, 2009). Such proteins, termed effectors, 
are thought to establish and maintain a compatible interaction 
between the pathogen and host. The processes that drive evolu- 
tion of effector diversity are of great interest because pathogen's 
effector genes and host resistance genes are the interacting "gene- 
for-gene" pairs that drive coevolution in these pathosystems 
(Jones and Dangl, 2006; Stergiopoulos and de Wit, 2009). 



Secreted proteins can be identified from whole genome 
sequences through the utilization of bioinformatic tools to iso- 
late proteins with N-terminal secretion signals. Bioinformatic 
pipelines can then be used to narrow predicted secreted pro- 
tein sets to putative effectors. These proteins contain features of 
known effectors such as elevated cysteine content (greater than 
2%), that would enable the formation of stabilizing disulfide 
bridges (Stergiopoulos and de Wit, 2009), and protein domains 
associated with pathogenicity. Length is a criteria used to iden- 
tify small secreted proteins (SSPs) from within putative effector 
protein sets, as SSPs are effector-like proteins with lengths less 
than 300 amino acids. Sequence comparisons alone do not pro- 
vide a reliable means to identify putative effectors since some 
known effectors are lineage-specific while others are conserved 
across taxa (Rep, 2005; Saunders et al., 2012; Giraldo and Valent, 
2013). Candidate effectors, a further distinction, are putative 
effectors that have additional support for roles in pathogenicity 
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(i.e., induced transcription or elevated expression in planta). 
Genetic evidence for functional redundancy of effectors, pre- 
sumably due to multigene families of effector proteins, whose 
members share similar functions, has been reported in several 
pathogens (Kamper et al, 2006; Rafiqi et al., 2012; Saitoh et al., 
2012; Giraldo and Valent, 2013). This suggests it would be useful 
to characterize families of proteins with effector-like characteris- 
tics so as to identify families that have expanded during evolution 
in association with the acquisition of pathogenic life history char- 
acteristics. Examining the evolutionary history of protein families 
across a set of diverse fungal taxa should help identify lineage- 
specific, putative effector protein families, families that may have 
evolved similar functions in more distantly related taxa, and 
families that may exhibit functional redundancy. 

Cronartium quercuum f. sp. fusiforme (Cqf) is a rust pathogen 
that has a complex life cycle with five spore types and exhibits 
alternation between two hosts, oak (Quercus spp.) and southern 
pines (Pinus spp.). The fungus incites fusiform rust disease on 
southern pines, leading to significant economic losses to the forest 
products industry. The impact of the disease on pine produc- 
tion has motivated extensive research on the genetic interaction 
between Cqf and pine. The objective of this study is to identify 
putative effector gene families in the Cqf genome through com- 
parative genomic analyses between Cqf and 15 other fungal taxa, 
including two other rust pathogens. We have identified families 
that have expanded in Cqf that we hypothesize are involved in 
conditioning stem gall phenotypes observed on the pine host. Our 
analyses provide a more thorough perspective on Cqf and rust 
pathogen evolution and also highlight the evolutionary patterns 
of putative effector families that Cqf employs to establish disease 
on two taxonomically diverse host species. 

MATERIALS AND METHODS 
GENE FAMILY CONSTRUCTION 

Complete proteomes were downloaded from the public databases 
of the National Center for Biotechnology Information (www. 
ncbi.nlm.nih.gov/genome), U.S. Department of Energy's 
Joint Genome Institute (jgi.doe.gov/fungi), and the Broad 
Institute (www.broadinstitute.org). Sixteen proteomes were 
obtained: (Basidiomycota) Cronartium quercuum f.sp. fusiforme 
Gil version 1.0 (Cqf; unpublished, jgi.doe.gov/Cronartium), 
Melamspora larici-populina version 1.0 (Mlp; Duplessis et al., 
2011a,b), Puccinia graminis f.sp. tritici CRL 75-36-700-3 race 
SCCL (Pgt; Duplessis et al, 2011a,b), Mixia osmundae IAM 
14324 version 1.0 (Mos; Toome et al., 2014), Sporobolomyces 
roseus version 1.0 (Sro; with permission; jgi.doe.gov/fungi), 
Rhodotorula graminis strain WP1 version 1.1 (Rgr; with per- 
mission; jgi.doe.gov/fungi), Ustilago maydis strain 521 (Uma; 
Kamper et al, 2006), Malasezzia globosa CBS 7966 (Mgl; Xu 
et al., 2007), Pisolithus tinctorius Marx 270 version 1.0 (Pti; with 
permission; jgi.doe.gov/fungi), Phanerochaete chrysosporium 
version 2.0 (Pch; Martinez et al., 2004), Heterobasidion irregulare 
version 2.0 (Hir; Olson et al., 2012), Serpula lacrymans S7.3 
version 2.0 (Sla; Eastwood et al., 2011), Agaricus bisporus var. 
bisporus H97 version 2.0 (Abi; Morin et al., 2012), Laccaria 
bicolor version 2.0 (Lbi; Martin et al., 2008), Amanita muscaria 
Koide version 1.0 [Amu; with permission; jgi.doe.gov/fungi), 
and (Ascomycota) Saccharomyces cerevisiae S288C (See; Goffeau 



et al., 1996), for a total of 200,313 proteins. Gene families were 
delineated by OrthoMCL v.5.0 software (Li et al., 2003) using 
default parameters (minimum e-value of le-05, minimum 
similarity of 50%). 

SECRET0ME PREDICTION 

The collective set of secreted proteins, or the secretome, of Cqf 
was identified bioinformatically. Annotation of a secreted protein 
is determined by signal peptide (SignalP 3.0 and 4.0; Bendtsen 
et al., 2004; Petersen et al., 2011), protein localization (TargetP 
1.1; Emanuelsson et al., 2000), and transmembrane domain 
(TMHMM 2.0; Krogh et al., 2001) bioinformatics prediction soft- 
ware (Feau et al. in prep.). Proteins predicted by TargetP 1.1 to 
be targeted for the mitochondrion (with RC values between 1 
and 3) were discarded and residual proteins are submitted to 
TMHMM 2.0. If no TM-domain is identified in the protein, or a 
TM-domain is predicted in the N-terminal region of the protein 
(i.e., in the first 70 amino acids), the protein is re-oriented toward 
SignalP 4.0; in any other case, the protein is discarded. SignalP 
4.0 either implements the SignalP-TM network to discriminate 
between a true signal peptide and an N-terminal trans-membrane 
region or the SignalP-noTM network if the program does not 
identify a TM-like domain in the N-terminal region of the pro- 
tein. In this last case (i.e., if the the SignalP-noTM network is 
implemented by SignalP 4.0), the protein is re-oriented toward 
SignalP 3.0 and a signal peptide prediction is positive if either 
both NN and HMM converged in a positive result or if NN 
D-score returns a positive result with a D-score > 0.5. 

ESTIMATION OF GENE TREES 

The protein sequences from each gene family were aligned using 
MUSCLE (Edgar, 2004). We assembled a collection of amino 
acid alignments from gene families with at least four sequences. 
For each of the gene family alignments, we performed a maxi- 
mum likelihood (ML) search to find the optimal topology using 
RAxML v.7.2.8 with the PROTCATJTT model (Stamatakis et al, 
2005). Gene tree estimates often contain much error and can be 
improved with knowledge of the underlying species tree (e.g., 
Rasmussen and Kellis, 2011). We constructed a species tree from 
a phylogenetic matrix of 2404 single copy genes with sequences 
from at least eight fungal taxa. We performed a ML search using 
RAxML v.7.2.8 with the PROTCATJTT model on the concate- 
nated single gene matrix to estimate the species tree. For each 
of the gene trees, we used TreeFix version 1.1.8 (Wu et al., 2013) 
to improve on the ML topology given the species tree. TreeFix 
searches for a statistically equivalent rooted gene tree topology 
that minimizes the number of duplications and losses implied by 
the species tree. For 10 of the gene families, the TreeFix runs did 
not complete in 1 week. For these gene trees, we rooted the ML 
tree with a root that minimizes the number of implied duplica- 
tions and losses using the program OptRoot (www.wehe.us). For 
all of the gene trees output from TreeFix or OptRoot, the loca- 
tions of the implied duplications and losses were mapped on the 
species tree using URec version 1.02 (Gorecki and Tiuryn, 2007). 

FUNCTIONAL ANNOTATION OF PROTEINS 

Functional annotations were obtained from the Joint Genome 
Institute's (JGI) Mycocosm (jgi.doe.gov/fungi; Grigoriev et al., 
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2014) for the 16 organisms included in the phylogenetic 
and gene family analyses. Protein domains were identified 
using the online InterPro interface (http://www.ebi.ac.uk/ 
interpro/; Hunter et al., 2012). Transmembrane domain regions 
were identified in amino acid sequences of proteins using 
TMpred (www.ch.embnet.org/software/TMPRED_form.html; 
Hofmann, 1993). Glycosylphosphatidylinositol (GPI) anchor 
sites were predicted using big-PI Predictor (http://mendel.imp. 
ac.at/gpi/gp i_server.html; Eisenhaber et al., 1999). 

RESULTS 

GENE FAMILY ANALYSIS 

The OrthoMCL analysis of the proteomes identified 19,489 gene 
families that contained 152,964 proteins. This protein count 
was ~76% (152,964/200,313) of the total proteins input into 
OrthoMCL analysis. Protein counts per gene family ranged 
from 2 (minimum size for a gene family) to 343 proteins, and 
the average family size was 7.8 proteins. Approximately 42% 
of the gene families had proteins encoded from only a sin- 
gle taxon, and families with proteins encoded in two or three 
taxa were the next most abundant families (Figure 1). Relatively 
few families contained proteins detected in 4-14 taxa, but more 
families contained proteins detected in 15 or 16 taxa (~12% 
of all families; 2,277/19,489). The families broadly conserved 
across all 16 sampled taxa are likely to contain core essential 
fungal proteins. The remaining ~24% of input proteins that 
did not group into families are considered true singletons, as 



they lack homologs within their own proteome or in the other 
taxa. 

To highlight gene families specific to the rust pathogen lineage, 
we compared gene family conservation between four pathogen 
genomes belonging to the subphylum Pucciniomycotina, which 
include three rust pathogens (Cqf, Mlp, and Pgt; Pucciniales) 
and a non-rust fern pathogen, Mixia osmundae (Mos; Mixiales). 
We selected the 4673 gene families containing proteins from at 
least one of these four pathogens (and no proteins from other 
sampled taxa) from the complete OrthoMCL family dataset. 
These families contained 22,784 proteins and exhibited vary- 
ing patterns of conservation across the four taxa (Figure 2A). 
Most prominently, 14,978 of the 22,784 proteins (65.7%) were 
encoded in only one of the four pathogen genomes, illustrating 
high levels of species specificity (Figure 2A). Fewer proteins were 
shared between two or more rust fungi in this subset of families 
(7512/22,784 or 33.0%) (Figure 2A). Of the 19,485 families deter- 
mined by OrthoMCL, 656 families (or 3.4%; Figure 2B) consisted 
of gene models found only in the three rust pathogen genomes, 
where each of the three rust pathogens had a representative gene 
model in the family. A total of 3466 proteins (Figure 2A) were 
ascribed to these rust pathogen-specific families. The largest fam- 
ily contained 249 proteins, and the smallest had 3 proteins. These 
656 families represent the "core" rust pathogen protein set. Of 
the sampled genomes, the two pathogens with the most uniquely 
shared families are Cqf and Mlp, which have 878 conserved 
families. 





FIGURE 1 I Gene families are predominantly species-specific in the sampled taxa. The proportions of gene families with proteins encoded in one through 
16 fungal taxa genomes (taxa count) are displayed for the 19,489 OrthoMCL gene families. 
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FIGURE 2 | Conserved proteins and families within only four 
Pucciniomycete pathogen genomes are mostly species-specific. Gene 
family (OrthoMCL) conservation within Pucciniomycete pathogens; Mixia 
osmundae (Mos), Cronartium quercuum f.sp. fusiforme (Cqf), Melampsora 



larici-populina (Mlp), and Puccinia graminis f.sp. tritici (Pgt). The values 
indicate the total number of (A) gene models or (B) gene families conserved 
in only these four species and absent in the remaining 12 fungal taxa 
included in the OrthoMCL analysis. 



IDENTIFYING PUTATIVE EFFECTORS 

We identified gene families encoding putative effectors in the Cqf 
genome. To highlight putative effectors, the predicted secretome 
(predicted secreted proteins; see Methods) was analyzed for cys- 
teine content and family-level conservation. The Cqf secretome 
harbors 666 SSPs, which are secreted proteins with fewer than 
300 amino acids (aa). The range in protein lengths within the 
secretome was 51-1716 aa with a median length of 249 aa. 

Analysis of Cqf gene families elucidated the evolutionary his- 
tories of secreted putative effectors. To identify putative effector 
families within the Cqf genome, we selected gene families with 
at least two secreted proteins, as these families would then con- 
tain at least two paralogous putative effectors and the family 
would have therefore expanded in the Cqf genome. In total, 132 
putative effector families were identified. Sixty-five of these fam- 
ilies were conserved effector families, with proteins from two or 
more fungal taxa. These families had sequences from 6.94 taxa 
on average (Table 1) and represent potential effectors with func- 
tions that can occur in a wide range of hosts. Alternatively, 67 
novel effector families were considered to be evolutionary inno- 
vations since the family members consisted of only Cqf proteins 
(Table 2). The average family size for conserved effector families 
(18.23 proteins) was significantly larger than Cqf -specific fami- 
lies (3.54 proteins; f-test, p-value < 0.001). However, there was 
no difference in the number of Cqf proteins per family in con- 
served (mean = 5.02 proteins) and Cqf -specific families (mean 
= 2.4 proteins). Families where all Cqf protein members are pre- 
dicted to be secreted were found in both candidate effector family 
types and at proportions that were not significantly different from 
one another (conserved families = 40/65, Cqf -specific = 44/67; 
Tables 1, 2). Evidence for potential sub- and/or neofunctionaliza- 
tion was observed in 23 of the 67 (34.3%) Cqf -specific putative 
effector families, as only a subset of proteins within these families 
received secretion predictions, suggesting distinct biological roles 
among family members. 

GENE GAINS AND LOSSES 

Gene gain and loss was quantified across all 16 sampled fungal 
taxa. We mapped the gene trees from gene families with at least 



four proteins onto a species tree to determine the patterns of 
duplication and loss across the 16 fungal taxa. In total, we exam- 
ined 10,371 gene trees containing 131,863 protein sequences. 
These gene trees implied a minimum of 49,539 duplications (i.e., 
gene family gains) and 21,789 losses (i.e., gene family contrac- 
tions and/or entire family loss). Over 93.9% of the duplications 
and 67.9% of the losses are species-specific, occurring in a sin- 
gle lineage at the tips of the species tree (Figure 3). The number 
of species-specific duplications was positively correlated with the 
size of a taxon's proteome (.R 2 = 0.93), suggesting that gene 
duplication is a mechanism for proteomic expansion and diver- 
sification for the selected fungal taxa (Figure 4). There was no 
obvious relationship between proteome size and species-specific 
duplication with life history forms (i.e., symbiotic, pathogenic, or 
free-living) (Figure 4). Species-specific losses were not correlated 
with the proteome size, but the rust pathogen lineage exhibited 
fewer losses than other sampled taxa (Figure 5). 

We identified genes that were gained and lost specifically in the 
rust pathogen clade. There were many gene losses (1217 events 
within 1 148 families) associated origin of the rust pathogen clade 
within Pucciniomycotina {Cqf, Mlp, and Pgt) compared to the 
number of gains (248 events, 142 families) (site R in Figure 3). 
The number of taxa represented in these 1 148 families range from 
2 to 16 species, with the largest proportion of families (10.6%) 
having representatives from all 16 taxa in the analysis (Figure 6). 
Families lost genes at the origin of the rust fungi appear to occur 
in few of the sampled fungal lineages than those that had duplica- 
tions in the rust fungi. Fifty percent of duplicated families contain 
proteins from 14 or more sampled taxa (Figure 6). Though the 
disproportionate level of gene losses prior to the common ances- 
tor of rust pathogens is striking, each of the three rust fungal 
species shows evidence of high species-specific rates of duplica- 
tion (Figure 3). In fact, 32.1% of all the duplications across the 
tree are specific to only one of the rust species (Figure 3). 

The proteome of the Cqf rust pathogen is enriched for novel 
proteins whose expansion has presumably contributed to spe- 
cialization in its pathosystem. Numerous species-specific dupli- 
cations have occurred following within the Cqf lineage (2730 
duplication events in 549 families; Figure 3). Of the 549 families 
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Table 1 | Conserved putative effector families have broad and narrow 
taxonomic distributions. 
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Table 1 | Continued 
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Gene families with greater than two Cqf predicted secreted proteins are listed. 
Data is ranked by the number of Cqf secreted proteins. The total number of pro- 
teins in each family is provided as well as the number of proteins belonging to 
the Cqf secretome (i.e., predicted secreted proteins). Asterisks adjacent to total 
protein counts indicate families where all members are Cqf secretome mem- 
bers. If no asterisk is present, only a portion of the family received secretion 
predictions. Family 5485 (bold) will be detailed later in article. 

that have undergone Gjf-specific duplications, 248 (or 45.17%) 
contain proteins not observed in any other analyzed fungal taxa. 
These 248 novel families comprise 14.5% of the annotated Cqf 
proteome (2017/13,903 proteins), highlighting the rapid expan- 
sion of novel, likely pathogenicity-related gene families. The vast 
majority (98.8%) of these novel families do not have BLASTp hits 
in the NCBI non-redundant database or have hits to unknown 
proteins (minimum e-value of le-10; Table 3) and 94.5% do not 
contain InterPro domains (unpublished, jgi.doe.gov/Cronartium; 
Hunter et al., 2012). Since the families that are unique to the Cqf 
lineage are largely uncharacterized, they likely follow the assump- 
tions for putative pathogenicity factors or effectors. Nearly 12% 
(234/2,017) of proteins encoded in the 248 novel Cqf families 
are members of the predicted Cqf secretome. This is significantly 
greater than the ~8% of entire Cqf proteome that also belongs 
to the secretome (Chi-square = 25.418, p-value = 0.0001). The 
protein characteristics of these secreted proteins are effector-like, 
as the average cysteine content is 2.2% and the median protein 
length is 272 amino acids. 

The families that were duplicated in the Cqf lineage and 
contain sequences from other taxa exhibit patterns of conserva- 
tion that differ from the families duplicated or depleted in rust 
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Table 2 | Potential sub- and neo-functionalization within Cqf-specific 
putative effector families. 
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Table 2 | Continued 



Gene family ID 


Cqf proteins 
in family 


Proteins in Cqf 
secretome within family 


16,052 


2* 


2 


16,078 


2* 


2 


16,079 


2* 


2 


1b,0oU 
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16,081 
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1 R 1 OO. 
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16,146 


2* 


2 


16,160 


2* 


2 


16,163 


2* 


2 


16,191 


2* 


2 


Total 


237 proteins 


164 proteins 


Average 


3.54 Cqf proteins 
per family 


2.4 secreted proteins per family 



Cqf-specific gene families with greater than two predicted secreted proteins are 
listed. The total number of proteins in each Cqf-specific family is provided as 
well as the number of proteins belonging to the Cqf secretome (i.e., predicted 
secreted proteins) are indicated. Asterisks adjacent to total protein counts indi- 
cate families where all members are Cqf secretome members. If no asterisk is 
present, only a portion of the family received secretion predictions. Family 9417 
(bold) will be detailed later in article. 

pathogens (site R, Figure 3). Instead, these Gjf-specific dupli- 
cated families (n = 549 families) are predominantly conserved in 
not only Cqf, but also 2-3 taxa (Figure 6). 

Cqf PUTATIVE EFFECTOR GENE FAMILIES— DISTRIBUTION AND 
EXPANSION 

Family 5485 is the largest family represented in the predicted 
Cqf secretome. The family contains 12 orthologous proteins 
(7 Cqf, 4 Mlp, and 1 Pgt proteins). Eleven of the 12 proteins 
have predicted N-terminal signal peptides (SignalP 4.0; Petersen 
et al., 2011), and all seven members from Cqf axe. annotated as 
belonging to the Cqf secretome. Domain architecture and con- 
servation data for Family 5485 proteins helps to predict their 
biological functions and putative roles in establishing infection. 
Additionally, 11 of the 12 proteins in this family contained three 
multicopper oxidase (MCO) domains and the remaining protein 
(Pgt_20719) contained two of the three domains (Figure 7A). 
The Interpro domains identified include: Cupredoxin domain 
(IPR008972), Multicopper Oxidase, Type 1 (IPR001117), and 
Multicopper Oxidase, Type 2 (IPR011706), and Multicopper 
Oxidase, Type 3 domain (IPR011707) (Figure 7A). A Copper- 
Binding Site domain (IPR002355) was identified in only three Cqf 
family members. These three proteins have a distinct phylogenetic 
history from other family members (Figure 7B). Generally, the 
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FIGURE 3 1 Gene loss and gain in the Basidiomycota fungal 
lineage highlights shared loss in the rust pathogen lineage and 
high levels of species-specific gain. Mapping putative gene 
duplications and losses across 16 fungal taxa. Values in blue are 
associated with gains/duplications, whereas orange indicates loss. 



Outside of parentheses are the number of gain or loss events that 
have occurred on the branch preceding a node, and within 
parentheses are the number of gene families associated with 
duplications or losses. The node denoted with R indicates the last 
common ancestor of the rust pathogens. 



phylogenetic relationships, as well as the genomic colocalization 
of the proteins in this family mirrors the domain architecture, 
providing insight into how these proteins evolved (Figure 7A). 
Several additional families of MCOs are present in the Cqf 



genome (i.e., Families 5853, 1542, and 1053), however, by def- 
inition, Family 5485 has a distinct evolutionary history from 
other families as evidenced by distinct family placement by 
OrthoMCL. 
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Proteome Size 



FIGURE 4 | The proteome size is positively correlated with the number 
of gene family duplications across taxa. The linear relationship 
(R 2 = 0.93) between the proteome size (i.e., protein count) and the number 
of species-specific duplications detected in the analyses from Figure 1 is 
depicted in this figure. Rust pathogens (Cqf, Mlp, and Pgr) are indicated 
with circles and non-rust pathogens with squares. The line of best fit (black) 
is indicated. Please reference Methods for species abbreviations. 



Family 9417 is the third largest family in the Cqf secretome, 
with all five of its proteins predicted as secreted. This family con- 
tains putative effectors likely involved in the establishment of 
disease, as all members have signal peptides, short lengths (aver- 
age 207 aa), and high cysteine content (average 6.5%). All five 
family members contain at least one fungal extracellular mem- 
brane (CFEM) domain (Interpro IPR008427). Five additional 
proteins encoded in the Cqf proteome contain CFEM domains. 
Two of the five do not belong to a gene family, and the remaining 
three proteins each were ascribed to different families containing 
orthologs from multiple fungal taxa, unlike Cqf -specific family 
9417. Similar to Family 5485, proteins of Family 9417 also colocal- 
ize in the genome, as three members are located on scaffold 43 of 
the Cqf assembly and the remaining two proteins are adjacent to 
one another on scaffold 5 (Figure 8). Protein members of Family 
9417 adhere to consensus domain structure and subcellular tar- 
geting of previously identified CFEM proteins. Online prediction 
algorithms detected transmembrane domain regions (Tmpred; 
Hofmann, 1993) and glycosylphosphatidylinositol (GPI) anchor 
sites (big-Pi Predictor; Nielsen et al., 1997) in a subset of the fam- 
ily proteins (Figure 8). All proteins, excluding Cqf91696, were 
predicted to have N-terminal transmembrane helices spanning 
amino acids 3-23 for both proteins. Two members, Cqf712797 
and Cqf651034 had C-terminal GPI anchor sites at amino acids 
223 and 302, respectively (p-values 1.25E-04 and 2.10E-04). Only 
Cqf91696 had no bioinformatic evidence of association with the 
fungal membrane. 

DISCUSSION 

This study provides the first detailed analysis of the secretome 
of the fusiform rust pathogen, Cqf, since the recent assembly 




*Mlp 



iUma 
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FIGURE 5 | Lack of relationship between proteome size and gene 
family losses across taxa. The relationship (ff 2 = 0.06) between the 
proteome size (i.e., protein count) and the number of species-specific 
losses detected in the analyses from Figure 1 is depicted in this figure. The 
line of best fit (black) and fit mean (red) are also shown. Rust pathogens 
{Cqf, Mlp, and Pgt) are indicated with circles and non-rust pathogens with 
squares. Please reference Methods for species abbreviations. 



and annotation of a draft reference genome (unpublished, 
jgi.doe.gov/Cronartium). Additional criteria used in isolating 
putative effectors from within the Cqf genome and its corre- 
sponding secretome, included proteins exhibiting rust pathogen- 
specific and Cqf -specific gene family membership. Following 
gene family constructions, we highlighted putative effectors with 
paralogs (within Cqf ) or orthologs/paralogs (between Cqf and 
other taxa) within the Cqf secretome. Over half (51%) of pro- 
teins considered to be effector-like (small, cysteine-rich, secreted 
proteins) belong to gene families. This is comparable to results 
found in the hemibiotrophic pathogens Phytophthora ramorum 
and P. sojae where 77% of their secretomes are found in multigene 
families (Tyler et al., 2006). These findings demonstrate the value 
of an evolutionary perspective for highlighting families harbor- 
ing putative Cqf effectors. Altogether, the large-scale comparative 
genomics analyses in this study help elucidate the unique patterns 
of evolution in a rust proteome and its associated secretome. 

PUTATIVE EFFECTOR FAMILIES 

With the completion of the Cqf draft genome, it is important to 
identify proteins that maybe involved in establishing disease, such 
as effectors, on oak and pine hosts. Based on the evolutionary 
forces presumed to act on effectors, in combination with a trio of 
rust pathogen genomes facilitating comparative analyses, we can 
now do experiments not previously feasible. We suggest this is a 
reasonable approach to identifying putative effectors that com- 
plements more conventional methods. Previous studies searched 
for effectors within other systems based on the presence of a sig- 
nal peptide, cysteine richness, and short protein lengths (<300 
aa) (Joly et al, 2010; Cantu et al., 2011; Duplessis et al., 2011b; 
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FIGURE 6 | Families gained and lost in rust pathogens and Cqf have varying levels of taxonomic conservation. The proportion of gene families (y-axis) 
that contain protein members from 1 to 16 fungal taxa (x-axis) among those that have either expanded or contracted in the rust fungi or Cqf. 



Table 3 | Cqf -specific duplicated gene families contain predominantly 
uncharacterized proteins. 



Gene family annotation Number of gene families (proteins) 



No hits 


183 (1510) 


Unknown protein 


57 (430) 


20S Proteasome subunit alpha 6 


2(9) 


Zinc finger CCHC-type protein 


1 (33) 


HIV-1 retropepsin, polyprotein 


1 (13) 


Polysaccharide lyase family 4 


1 (7) 


Reverse transcriptase 


1 (5) 


MFS transporter, inorganic 


1 (5) 


phosphate transporter 




CFEM domain containing protein 


1 (5) 



Functional annotation of the 248 gene families duplicated only in Cqf by BLASTp 
against the non-redundant NCBI database (minimum e-values of le-10). A fam- 
ily was ascribed a function if more than two proteins in the family received 
the same top annotated BLASTp hit. The number of proteins within families 
is indicated in parentheses. 



Hacquard et al., 2012; Saunders et al, 2012). This study high- 
lights the usefulness of comparative genomic analyses to examine 
the evolutionary history of each secretome member, and that this 
approach can also be complemented with structural character- 
istics of predicted secreted proteins. The rationale behind these 
comparisons is that effector families conserved in rust fungi and 
unique to Cqf are candidates for conditioning rust pathogen and 
Cqf infection strategies, respectively. 

We observed species-specific proteomic gene family 
gains/duplications in the Cqf lineage, a subset of which represents 
putative effectors. The paralogous nature (i.e., multi-copy) of 
their protein family members indicates functional redundancy, 



which is consistent with other pathogenic fungi (Kamper et al., 
2006; Saitoh et al., 2012). We have identified two lines of evidence 
that point toward neo- and sub-functionalization in Cqf putative 
effector families. First, differential subcellular localization pre- 
dictions have been observed within putative effector families. In 
about 34% of Cqf -specific families, only a subset of proteins are 
secreted from the fungal cell, while remaining family members 
are not predicted for secretion, thus remaining within the fungal 
cell. This pattern suggests that secreted proteins with effector 
function may have evolved from non-secreted proteins without 
an effector function or vice versa. Second, changes in domain 
architecture of proteins within putative effector families also 
points to neo- or subfunctionalization. For example family 
5485 contains MCO laccase-like enzymes and a single clade of 
three Cqf proteins that have acquired a MCO copper binding 
site in the evolution of this family. It is possible that these 
proteins have novel or distinct functions within Cqf than their 
paralogs within the genome. This family is a strong putative 
effector family because all Cqf members belong to the predicted 
secretome and it has undergone Gjf-specific family duplications. 
Protein members within this family co-localize in the genome, 
possibly resulting from tandem duplication from non-equal 
crossing over. Various functions have been ascribed to previously 
identified fungal MCOs including lignin degradation (Leonowicz 
et al, 2001; Lundell et al., 2010), melanin synthesis (Langfelder 
et al., 2003), fruiting body formation (Kues and Liu, 2000), 
and pathogenicity on hosts (Zhu and Williamson, 2004). This 
family has expanded in Cqf, the first sequenced rust pathogen 
that forms stem galls in woody tissues, and we hypothesize that 
these enzymes play a role in gall formation. The most common 
function for laccases/MCOs in basidiomycete fungi is lignin 
metabolism (Thurston, 1994; Kues and Rtihl, 2011). However, 
this gene family exhibits a lack of conservation with known 
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FIGURE 7 | Domain structure, phylogenetic relationships, and 
colocalization of copper oxidases. (A) Phylogenetic relationship 
estimated with TreeFix (Wu et al., 2013) for the 12 members of Family 
5485. Tree reconstructed with Tree of Life viewer (Letunic and Bork, 2011). 
See methods for additional details for gene tree estimation. Branch lengths 
are not informative for this tree. Domain architectures are indicated for each 
protein: N-terminal signal peptides per SignalP 4.0 (blue; Petersen et al., 
2011); (MCul— IPR01117) Multicopper oxidase Type 1 (green); 
(MCu2— IPR011706) Multicopper oxidase Type 2 (orange); 
(MCu3— IPR011707) Multicopper oxidase Type 3 (purple); and 
(CuBS — IPR002355) Multicopper oxidase, copper-binding site as 
determined by InterPro (red; Hunter et al., 2012). The colocalization of 
proteins on scaffolds is indicated by branches sharing the identical colors. 
Thin black branches do not colocalize. (B) Co-localization of proteins within 
MCO Family 5485 on scaffolds 7, 91, and 114. Family 5485 members are 
denoted with the ID above each gene model. Secretome members are blue 
arrows and non-secreted proteins are orange arrows. Gene orientation on 
the scaffold is indicated with arrows. Note: gene lengths are not to scale. 
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FIGURE 8 | Family 9417 proteins harbor both signal peptides and 
CFEM domains domains and colocalize in the Cqf genome. Domain 
architecture of Family 9417 protein members where signal peptides (blue), 
CFEM Interpro domains (green), GPI anchor sites (orange), and predicted 
transmembrane domains (red) are indicated for each protein. Three proteins 
colocalize together on scaffold 43, the remaining on scaffold 5 (indicated on 
right). 



A second gene family that has expanded in the Cqf lin- 
eage is Family 9417, which includes five Cqf -specific paralogs 
that co-localize in the genome. Similar to Family 5485, differen- 
tial domain architecture within this family implies that neo- or 
subfunctionalization may have occurred. Family 9417 contains 
putative effectors that harbor conserved, fungal-specific CFEM- 
domains. These domains exhibit a characteristic cysteine distribu- 
tion and have a broad taxonomic conservation in fungi (Kulkarni 
et al, 2003; Martin et al., 2008; Perez et al, 201 1). Predicted func- 
tions of proteins harboring CFEM domains include critical roles 
in appressorial development (Choi and Dean, 1997; DeZwaan 
et al., 1999), signal transducers, adhesion and cell-surface recep- 
tors (Kulkarni et al., 2003). In contrast to Family 5485 proteins, 
which may interact with the host during infection, the molecular 
target for Family 9417 proteins could be fungal. We hypothesize 
these proteins are secreted and may play roles during infection of 
the host. 



MCOs of lignin-degrading wood rots (P. chrysosporium and 
S. lacrymans), which points to the possibility that these enzymes 
may be involved in pathogenicity or may metabolize a plant sub- 
strate other than lignin. On both hosts, Cqf infects primary tissue 
that lacks high levels of lignification such as spongy mesophyll 
cells of oak leaves (Minis et al., 1996) and vascular cambium of 
pine (Gray et al., 1982). If the Family 5485 enzymes are involved 
in lignin degradation, the enzymatic activity may occur late in 
gall development on the pine host, where the tissues are more 
heavily lignified due to secondary wall formation. Though their 
biochemical targets are unknown in planta, we hypothesize that 
Family 5485 enzymes are secreted during infection and condition 
the gall phenotype on the pine host. Further studies are required 
to elucidate their true role in disease. 



EVOLUTION OF GENE GAIN AND LOSS 

Patterns of gene family loss and gain for rust fungi highlight 
major shifts in their proteomes, possibly associated with the rust 
pathogen's obligate biotrophic lifestyle. The origin of the rust 
pathogen clade is associated with nearly five times more losses, or 
family contractions, than duplications. There are many possible 
mechanisms for gene loss in rust fungi. For this reason, further 
investigations are required to both identify specific mechanisms 
and quantify their levels of effects on gene family evolution in 
rust fungi. However, we hypothesize that the since obligate biotro- 
phy has evolved multiple times in fungi (Spanu, 2012), the skew 
toward gene loss in the rust pathogen lineage might be asso- 
ciated with the shift from the life history of its ancestral state 
to that of the obligate biotrophic pathogens we observe today. 
These lost and/or contracted families exhibit broad taxonomic 
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conservation and may have been constituents of the ancestral 
"core" fungal gene set, suggesting that they are unnecessary for 
obligate biotrophic but may be necessary for free-living and sym- 
biotic species. For example, enzymes integral to the sulfur and 
nitrogen assimilation pathways are missing in Cqf (unpublished, 
jgi.doe.gov/Cronartium), Mlp, and Pgt (Duplessis et al., 2011b). 
This also suggests that evolution for obligate biotrophy drives 
toward an irreversible life history shift (Spanu, 2012). 

Although the rusts have undergone considerable gene fam- 
ily losses and contractions, they exhibit some of the largest 
proteomes in fungi. Much of their proteome size appears 
to be due to species-specific duplications. Nearly one-third 
(32.1%) of all observed duplications across all of the sam- 
pled basidiomycete fungi are rust taxon-specific duplications. 
The high levels of species-specific duplication yield dispropor- 
tionately greater numbers of newly-evolved genes in the rust 
pathogen genomes compared to ancient or conserved genes 
(genes shared with older lineages) in each proteome. The pres- 
ence of so many species-specific duplications suggests that 
the rusts have highly labile genomes. This is consistent with 
the large (>10%) genomic size variation detected in progeny 
from a single Cqf cross relative to parental isolates (Anderson 
et al., 2010). Such rapid changes, occurring in the span of 
a single generation, could facilitate the gene gains and losses 
observed in our analyses. The close association with hosts 
may foster a labile and diverse genome, enabling the para- 
sites to rapidly adapt to the continually evolving host resistance 
pathways. 

COMPARATIVE ANALYSIS AND GENETIC MAPPING TO VALIDATE 
PUTATIVE EFFECTORS 

Further characterization of putative effectors in Cqf could be 
accomplished with analysis of selection potentially arising from 
host resistance mechanisms (Allen et al., 2004; Aguileta et al., 
2009; Barrett et al, 2009; Thrall et al, 2012). In addition, expres- 
sion analysis can be informative, since secreted proteins with 
specific expression profiles during infection are stronger effec- 
tor candidates (Ellis et al., 2009). Time-course experiments have 
been successful in other rust pathogen systems to elucidate the 
effector-like proteins involved in multiple or highly specific stages 
during infection (Joly et al., 2010; Duplessis et al., 2011a; Bruce 
et al., 2014). Also, resequencing of closely related rust pathogens 
such as Cronartium ribicola, C. flaccidum, and Peridermium hark- 
nessii (Vogler and Bruns, 1998) would improve precision of gene 
family delineations and identification of true singleton Cqf effec- 
tors, which are likely to be more newly evolved than effectors in 
families, and may therefore be products of highly-specific host- 
Cqf coevolution. Finally, a subset of the predicted effectors are 
avirulence proteins and are, by definition, involved in genotype- 
specific "gene-for-gene" interactions with hosts. These putative 
avirulence effectors can be validated through genetic mapping 
to their corresponding host resistance genes, an approach that 
has previously been successful in identifying the first avirulence 
protein locus in Cqf (Kubisiak et al., 201 1). Altogether, these val- 
idation approaches will yield true members of the Cqf secretome 
and provide additional insight into the biological functions for 
effectors infecting oak and pine. 
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